This lab should give you a chance to work on a real database tuning problem. The problem is related to performance of a large linguistic database. For this lab, you are given only one table (~170 MB), and some of the typical queries using this table. You must try to improve performance of the queries. Some important things: our client (Prof. William Fletcher) uses a MySQL database. The database is not updated frequently (in fact, I think there are no updates, just selects). The database can potentially be used by a large number of users concurrently. They are accessing the database through the web interface provided (see the link to the n-gram query interface below). You can play with the interface yourself, to see how it works and better understand the application.
For this lab, you should first re-create the database used by our client (execute the CREATE TABLE statement, then download the data file provided and load it into the table, by using LOAD DATA INFILE). Then, start working on tuning the database to improve the performance. Document the things you try, so I can grade your work. Please provide a set of recommendations for improving the performance of the database, and the rationale for those changes.
Below is the information provided by our client.
My "Phrases in English" site has i.a. word n-gram tables of all sequences of n words occurring 3 or more times for n in the range of 1-8. This link is to the n-gram query interface on the development site which sets a flag to display the SQL queries used. Try various combinations of words (multiple space-separated words in a single field are ORed), wildcards, exclude etc. to explore the various queries. A list of typical SQL queries is given at the end. A list of actual users queries is also provided. Notes:
Descriptions of how the data were normalized / preprocessed and of the PoS (part-of-speech) tags are in the documentation (see menu on query page).
The MySQL CREATE TABLE statement for the 3-gram table is shown
below. Note that the indices were added after the data were read into the
database with the LOAD DATA INFILE statement.
Since users typically want the most frequent n-grams, I resorted the table by descending frequency, which explains the haphazard ID sequence. This really bumped up performance from the original alphabetized table.
Some queries may match 50-100 k rows, but most users leave the default of 1k rows in the results.
One colleague has a query interface to the same dataset that uses a very different (ultra-denormalized) approach, described here on p. 6ff. I have tried something similar, but queries simply took forever with MySQL on our server. He uses SQL Server on a dedicated machine.
Any suggestions for improving performance before I update this database and create a new much larger one (see http://webascorpus.org/searchwc.html, currently untagged) would be most welcome.
AssocProf William H. Fletcher, x3-6362, fletcher@usna.edu.
Download the 3gram
data here (163 MB -- don't display in browser; use Save as... Uses
MySQL default row / column delimiters for LOAD DATA INFILE)
CREATE TABLE `3grams` (
`3gramcount` smallint(5) unsigned NOT NULL default '0',
`3gram` varchar(90) NOT NULL default '',
`3grampos` varchar(11) NOT NULL default '',
`3gramid` mediumint(8) unsigned NOT NULL default '0',
`3grampos1` enum('AJ0','AJC','AJS','AV0','AVP','AVQ','CJC','CJS','CJT','CRD','ORD','AT0','DPS','DT0','DTQ','NN0','NN1','NN2','NP0','PNI','PNP','PNQ','PNX','POS','PRF','PRP','VBB','VBD','VBG','VBI','VBN','VBZ','VDB','VDD','VDG','VDI','VDN','VDZ','VHB','VHD','VHG','VHI','VHN','VHZ','VM0','VVB','VVD','VVG','VVI','VVN','VVZ','EX0','ITJ','TO0','UNC','XX0','ZZ0','-*-') NOT NULL default 'AJ0',
`3grampos2` enum('AJ0','AJC','AJS','AV0','AVP','AVQ','CJC','CJS','CJT','CRD','ORD','AT0','DPS','DT0','DTQ','NN0','NN1','NN2','NP0','PNI','PNP','PNQ','PNX','POS','PRF','PRP','VBB','VBD','VBG','VBI','VBN','VBZ','VDB','VDD','VDG','VDI','VDN','VDZ','VHB','VHD','VHG','VHI','VHN','VHZ','VM0','VVB','VVD','VVG','VVI','VVN','VVZ','EX0','ITJ','TO0','UNC','XX0','ZZ0','-*-') NOT NULL default 'AJ0',
`3grampos3` enum('AJ0','AJC','AJS','AV0','AVP','AVQ','CJC','CJS','CJT','CRD','ORD','AT0','DPS','DT0','DTQ','NN0','NN1','NN2','NP0','PNI','PNP','PNQ','PNX','POS','PRF','PRP','VBB','VBD','VBG','VBI','VBN','VBZ','VDB','VDD','VDG','VDI','VDN','VDZ','VHB','VHD','VHG','VHI','VHN','VHZ','VM0','VVB','VVD','VVG','VVI','VVN','VVZ','EX0','ITJ','TO0','UNC','XX0','ZZ0','-*-') NOT NULL default 'AJ0',
PRIMARY KEY (`3gramid`),
KEY `g3` (`3gram`),
KEY `gc3` (`3gramcount`),
KEY `3gramp1` (`3grampos1`),
KEY `3gramp2` (`3grampos2`),
KEY `3gramp3` (`3grampos3`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Here are some typical SQL queries for this table (the values given are somples of possible values)
Other possible queries:
- SELECT SQL_CALC_FOUND_ROWS 3gram, 3gramcount , 3grampos FROM 3grams WHERE 3gramcount>=1 AND ( 3gram LIKE '% determine %') ORDER BY 3gramcount DESC, 3gram LIMIT 0, 1000
- SELECT SQL_CALC_FOUND_ROWS 3gram, 3gramcount , 3grampos FROM 3grams WHERE 3gramcount>=1 AND ( 3gram LIKE 'go on %') AND ( 3grampos1 LIKE 'V_D' ) AND ( 3grampos2 NOT LIKE 'PR_' ) AND ( 3grampos3 LIKE 'V_G' ) ORDER BY 3gramcount DESC, 3gram LIMIT 0, 1000
- SELECT SQL_CALC_FOUND_ROWS 3gram, 3gramcount , 3grampos FROM 3grams WHERE 3gramcount>=20 AND ( 3gram LIKE 'keep % %') AND ( 3grampos2 LIKE 'AV_' ) AND ( 3grampos3 LIKE 'V_G' ) ORDER BY 3gramcount DESC, 3gram LIMIT 0, 1000
- SELECT SQL_CALC_FOUND_ROWS 3gram, 3gramcount , 3grampos FROM 3grams WHERE 3gramcount>=20 AND ( 3gram LIKE 'take part in') AND ( 3grampos1 LIKE 'VV_' ) ORDER BY 3gramcount DESC, 3gram LIMIT 0, 1000
- SELECT SQL_CALC_FOUND_ROWS 3gram, 3gramcount , 3grampos FROM 3grams WHERE 3gramcount>=20 AND ( 3gram LIKE 'take part in') ORDER BY 3gramcount DESC, 3gram LIMIT 0, 1000
- SELECT SQL_CALC_FOUND_ROWS 3gram, 3gramcount , 3grampos FROM 3grams WHERE 3gramcount>=20 AND ( 3gram LIKE 'take part %' OR 3gram LIKE 'took part %') ORDER BY 3gramcount DESC, 3gram LIMIT 0, 1000
- SELECT SQL_CALC_FOUND_ROWS 3gram, 3gramcount , 3grampos FROM 3grams WHERE 3gramcount>=20 AND ( 3gram LIKE 'like %to %') ORDER BY 3gramcount DESC, 3gram LIMIT 0, 1000
- SELECT SQL_CALC_FOUND_ROWS 3gram, 3gramcount , 3grampos FROM 3grams WHERE 3gramcount>=20 AND ( 3gram LIKE 'like %to %') AND ( 3grampos1 LIKE 'V_B' ) AND ( 3grampos2 LIKE 'V_I' ) ORDER BY 3gramcount DESC, 3gram LIMIT 0, 1000